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MODEL SPECIFICATION VIA SEQUENTIAL COHERENCE AND 

BACKWARD INDUCTION 

P. RICHARD HAHN 


Abstract. This paper describes how to specify probability models for data anal¬ 
ysis via a backward induction procedure. The new approach yields coherent, prior- 
free uncertainty assessment. After presenting some intuition-building examples, 
the new approach is applied to a kernel density estimator, which leads to a novel 
method for computing point-wise credible intervals in nonparametric density esti¬ 
mation. The new approach has two additional advantages; 1) the posterior mean 
density can be accurately approximated without resorting to Monte Carlo simu¬ 
lation and 2) concentration bounds are easily established as a function of sample 
size. 


1. Preliminaries 


1.1. Introduction. Among de Finetti’s enduring insights was that observable quan¬ 
tities should be the central object of subjective probability. In his seminal work 


de Finetti, 1974, 1975 , specific likelihoods and priors over the associated parameters, 


arise directly from symmetry considerations concerning future, yet-to-be-observed, 
data. In particular, certain forms of exchangeability imply certain likelihood func¬ 
tions. To note a classic example, the normal distribution arises by assuming that 
any n data points have a uniform distribution on the surface of a sphere with a given 


center and diameter (for details, see Schervish [1995 example 2.117). 

However, an outstanding limitation of applied Bayesian modeling is a profound 
lack of intuition concerning how they will behave under misspecification. It is well- 
known that misspecihed Bayesian models will converge to the so-called “pseudo- 


true” posterior |Kleijn and van der Vaart 2006 , the one among the assumed model 
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class that is nearest in Kullback-Leibler divergence to the actual data generating 
process. However, the form of the pseudo-true model depends on features of the 
data-generating process that may be unrelated to the desired estimand. This state 
of affairs is obviously unsatisfactory when simple, consistent non-Bayesian estimators 
may be known to exist. This paper asks whether it may be possible to begin Bayesian 
inference with a well-understood estimator and from that starting point, produce 
Bayesian posterior uncertainty statements. 

With this goal in mind, we propose to weaken de Finetti’s exchangeability as¬ 
sumption to a similar condition termed sequential coherence. Interestingly, infinite 
sequences are sequentially coherent if and only if they are exchangeable (Theorem 
1.1 in Kallenberg; |2005]), meaning that making the sequential coherence assumption 
for infinite sequences returns you to the setting of de Finetti’s theorems, and no 
flexibility has been gained. As such, we consider specifying models for large, but 
finite, vectors of future data. 

In brief, the new approach to model specification proceeds as follows. Instead of 
starting with a likelihood and a prior, one specifies an estimator of the predictive 
distribution of the data, based on the observed data as well as future, unobserved, 
data. By imposing sequential coherence, this estimator defines a sequence of pre¬ 
dictive distributions, which in turn jointly define a posterior distribution over any 
quantity of interest (means, quantiles, correlations, etc). In this way, one knows, by 
explicit construction, the form of the limiting posterior distribution, irrespective of 
the true (unknown) data generating mechanism. At the same time, straightforward 
sequential simulation yields corresponding Bayesian uncertainty assessments. 
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The suitability of the new approach is exemplified via a detailed study of the 
problem of univariate density estimation, a relatively simple and well-understood 
statistical task that is nonetheless of routine practical importance. Comparisons are 


drawn to the earlier quasi-Bayesian kernel density estimation approaches of West 


1991 and Bernardo 1999 


1.2. Sequential coherence. Begin by assuming a sample size sufficiency condition. 
For some large N, 

i) all yj, for j > N, are independent and identically distributed with density 
function Pn(v) = p(y \ Vi-.n) depending only on the sample yi-.N- 


Informally, in a subjective Bayesian learning context, this assumption states that, 
having observed a sample of size N, one would feel comfortable treating any addi¬ 
tional observations as independent and identically distributed from the predictive 
density p(y \ y 1:N ). 

From the sample size sufficiency assumption, a sequence of predictive distributions 


is derived so as to satisfy a sequential coherence condition Goldstein, 1983, Zabcll 


2002, Parmigiani and Inoue, 2009 


ii) For pt{y) = p(y \ y 1:t ), 


( 1 ) 



Pt+i(y I yt+i)pt(yt+i)dy t +i, 


for 0 < t < N. 


This condition asserts a certain relationship between subsequent and previous pre¬ 
dictive distributions: informally,my expected predictive density tomorrow is my pre¬ 
dictive density today. Phrased this way, it is clear that this is a Martingale condition. 
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Writing X t = p(y \ IW), sequential coherence can be stated as the condition that 
E(X t+ i j Xi :t ) = X t . This condition has 
2005] and also marginalization consistency 

With a coherent sequence of predictive distributions in hand, uncertainty intervals 
can be calculated via sequential forward simulation, starting from p n (y ), based on 
an observed sample yi- n , as described in the next subsection. Notably, this approach 
to posterior uncertainty make no explicit mention of a prior distribution, although 
one may be implied. 

Section [2] describes how to derive a coherent sequence of predictive distributions 
by working backward from a specified Pn{v)- The working details of this approach 
are illustrated via two small examples and compared to the usual Bayesian posterior. 
Section [3] applies the method to a kernel density estimator, leading to an efficient 
method for producing point-wise credible intervals of an unknown density function. 


been called contractability Kallenberg 


West, 1991, Bernardo, 1999 


1.3. Uncertainty assessment via sequential forward simulation. Although 

contemporary Bayesian statistics works predominately with probability models spec¬ 
ified in terms of priors and likelihoods, it is possible to conduct posterior inference 


working directly with joint distributions on observables, a la de Finetti de Finetti 


1974, 1975 . Recall the compositional representation of a joint distribution 


(2) p(y l:n) = Po(yi)Pl(y 2 I yi)P2{V3 | 2 / 1 : 2 ) ...Pn-l{y n I 2/l:(ra—!))■ 


Posterior distributions can be derived from this sequence of predictive distributions, 
via forward simulation, as follows. First, with past (observed) data yi :n in hand, 
simulate y* +1 from p n (y \ y 1:n ). Then simulate y* n+2 from p n+1 (y \ yi :n ,yl+ 1 ), and 
then y * +3 from p n+2 (y | yi :n , y\ n+ \y( n + 2 ))-> e ^ c - Continue this process, sequentially 
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simulating a total of m hypothetical future observations, arriving finally at distribu¬ 
tion 


( 3 ) Pn(V | yi:n,y*(n+l):N), 

where N = n + m. From this distant-future predictive distribution, extract any sum¬ 
mary of interest from Pn{'U \ Ui-.n, 2/( n +i)-iv)! ca U it 9 = g\pN\- Typical choices for g[-\ 
might be a mean, a quantile, a high density region or even the entire density function. 
Repeating this process, one performs a Monte Carlo integration over hypothetical 
future data realizations; each 9^ denoting property g[-\ of a different m-step ahead 
posterior predictive distribution, corresponding to the jth simulated realization of 
future data y? n+1 y N - The distant-future quantity 6 is uncertain precisely because 
many different future realizations are possible. 

Taking IV —>■ oo makes the connection with the usual approach. A model parameter 
9 can be thought of as a functional g[-] of the posterior predictive distribution Pn(v) = 
p(y | Vi-.n) as N —> oo so that 

(4) 9 = g[poo{y)\- 

That is, supposing that p(yi, ... ,|/oo) is stipulated, 9 simply picks off some feature 
of the conditional distribution of one element, given an infinite amount of past data. 
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Iteration 

Figure 1. Gray lines depict 20 simulated data sequences from the 
prior predictive; they terminate 1000 steps in the future at points that 
are uniformly distributed in the interval. Solid lines show 20 simulated 
data sequences beginning from the point n = 10 with an observed sam¬ 
ple average of 0.7; restricting to sequences that run through the point 
(10, 0.7) yields sample paths that terminate in a more concentrated 
region. 


Example: Bernoulli likelihood. Suppose Y t ~ Bernoulli^) with prior 6 ~ Uniform(a, fi). 
Integrating over this prior yields the following predictive updates 

PtiVt+i | Vi-.t) = Bernoulli I ) , 

\a t + fit J 

a t — ot-t-i + Uti 

fit = fit- i + l — yt- 

Now, suppose n = 10 observations are observed, and that seven of them are ones: 
Yn= iVi = 7 - Figure 0 shows simulated predictive sequences 1000 steps into the 
future from the prior and from the posterior. Figure [2] shows that repeating this 
exercise 5000 times recapitulates the known Beta(8,4) posterior distribution nicely. 
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e 

FIGURE 2. The histogram at t — 1000 for 5000 simulated posterior 
predictive data sequences for n = 10, y n = 0.7; it nicely recapitu¬ 
lates the known Beta(8,4) posterior distribution for 6 , which is shown 
overlaid in black. 

Example: Gaussian likelihood with known variance. Suppose 17 ~ N(0,1) with prior 
6 ~ N(/i O ,0o 1 ). Integrating over this prior yields the following predictive updates 


Pt(yt+ i I Vi-.t) = N (p t , 1 + 1 /<f> t ), 

Vt + ht-l&t-l 
h't — 1 , , 

1 + Vt -1 

(fit = 1 + (fit- 1- 

Forward simulation yields (approximate) posterior distributions over 6 = y N , as 
in the Bernoulli example above and similarly recapitulates, as expected, the usual 
Bayesian posterior. 

These two example demonstrate that an explicit likelihood-prior specification is 
unnecessary for producing posterior distributions. This fact will be crucial for the 
new model specification approach, which by-passes the likelihood-prior representa¬ 
tion altogether, working entirely in the space of predictive distributions. 
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2. Prior-free model specification via backward induction 

It is possible to determine the sequence in ([2]) not by integrating a specified like¬ 
lihood over a specified prior distribution, but by iteratively solving for each term in 
the product by directly enforcing (JTJ) , starting from Pn(u) and working backward. 
This section works through this approach on three small examples. The next section 
uses the backward induction approach to derive a new method for nonparametric 
density estimation. 


Example: Bernoulli likelihood. Assume that for a sample of size N and y = N~ L JT y tl 
a sufficiently accurate predictive distribution for Y N+1 is Bernoulli (y). Write 7 r t = 
Pr(y = 1 | yi :t ) and 7 r t (z) = Pr(Y — 1 \ Y t — z). Plugging these definitions directly 
into (JT|) gives 

n N -i = n N (l)n N -i + vr A r(0)(l - vtat.i) 

_ [N^l_ 1\ (N — 1) 

( 7 ) - ( N VN-1 + J-j: ) Tv—1 H- J-f - VN-iO- - TTjv-i), 

= Vn- 1- 


Repeating the same argument shows that the coherent predictive sequences use the 
current sample average at time t as the prediction probability for observation t + 1. 


Simulation from this sequence, as described in Section |1.3[ yields a posterior dis¬ 
tribution over 6 = y^. 

Note that to duplicate the Bayesian solution demonstrated in the previous section, 


one can “seed” the backward induction procedure with two pseudo-observations, one 
of which is a one and the other a zero. 
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Example: Gaussian distribution with known variance. Assume that Yiv+i N(i/*, 1) 
for a large fixed N. Equivalently, Y N+i = y N + for e n ~ N(0,1), or in terms of 
the random variable Y N , Y N+X = jjY N + Y-±y N _ 1 + ejv- Because the sum of two 
Gaussians is again Gaussian, it is only necessary to find a Gaussian distribution for 
Y/v that satisfies the above. Therefore, solving for the mean and variance gives 


EYat — UN-1 


( 8 ) 


N — 1 
N 


+ —EYat + Eejv 


EY ^ — dn-i 


VVW =^+Ve N 


N 2 


Noting that VY/v — ~ f V, y defines a recursion, one can compute 


( 9 ) 


vy,= n 


t+l<j<N 


r-i 


n (1-r 2 )- 1 . 


for any t. How different this is from the usual Bayesian approach depends on the 
value of N. With orthodox Bayes, N —y oo. Figure [3] shows how the variance decays 
for N = 20 versus N = 100, compared to the standard Bayesian approach in the 
previous section, with 0 O = 0. 


Note that the predictive sequences arrived at by backward induction in both the 
binomial and Gaussian examples correspond to improper prior distributions. (Sim¬ 
ilarly, it will be seen that the kernel density backward induced model is patently 
ill-defined for po(y).) It is worth considering if this should be seen as troubling. It 
is well-known that improper priors can lead to incoherence | Eaton and Freedman, 
, essentially because they correspond to improper prior predictive distributions. 


2004 
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Figure 3. At N = 20 the predictive variance decays at a faster rate 
than the standard Bayesian model. By N = 100, the difference in the 
decay rates is nearly imperceptible. The horizontal axis of the second 
panel runs only to 20, rather than 100, for better visual comparison. 


However, the distribution over F(„ +1 ) ; a t is best thought of as a tool for inducing post- 
data subjective uncertainty assessments. As such, if any coherence arguments apply 
(see Section [4]), it would pertain merely to the post-data predictive distributions. 
By construction, proper joint distributions over future outcomes are obtained and 
provide a proper posterior distribution over 9 = g\pN~]- More interestingly, the im¬ 
propriety of po(y) is easy to remedy with the use of “pseudo-observations” to define 
the one-step-ahead predictive distribution, as suggested previously for the binomial 
example. Although pseudo-observations are widely known as one way to characterize 
priors in the exponential family, the use of pseudo-observations in the kernel density 
model proposed in the following section would also yield a proper prior predictive 
distribution. 


Example: Bayes rule. The previous two examples admitted closed-form solutions 
essentially because they are both in the natural exponential family with quadratic 


variance functions Morris, 1982 . In particular, solving for the sequential coherence 
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condition is possible because this family is closed under convolution of a linear trans¬ 
formation. To see that sequential coherence is more general than this restrictive case, 
it is instructive to see how Bayes rule implies sequential coherence. Begin with the 
sequential coherence condition, 

Pt-i(y) = [Pt(y | x)p t _i(x)dx 


and simply substitute in the corresponding Bayesian prior and posterior predictive 
distributions: 


I f(y I P)n(P)dp 
j f(y I P)n(P)dp 
I f(y I P)n(P)dp 
I f(y I P)n(P)dp 


f(y I 6)ir{6 I x)dd 


f( x | 


dx, 


f(y I o) 


f(x I 6)tt(6) 

J f(x\ rj)ir{rj)dr) 


dO 


f( x I 


dx, 


' f(y | 6)f(x | 6)ir(6)d6 
f(y I 9)f{x | 0)n(6)dxd6, 


/ fix I Qdj 

f f(x | r))n{r])dr) 


dx, 


J f{y I P)n(P)dp 


f (y I 6)ir(6)d6. 


Thus, we see that if /(• | •) and 7r(-) is the same in each term above, we satisfy 
sequential coherence. What is notable about this derivation is that 6, £ and P 
need not refer to the same parameters; formally, we have made no mention of a 
single shared measure space. From the perspective of sequential coherence, the prior 
distribution is merely a technical device for passing information between predictive 
distributions in a coherent fashion. 
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The remainder of the paper describes a sequentially coherent model which is more 
complicated than the simple Bernoulli and Gaussian examples above, but which is 
not obtained by a direct application of Bayes rule. 


3. A BACKWARD INDUCED MODEL FOR NONPARAMETRIC DENSITY ESTIMATION 

3.1. Coherent kernel density predictive distributions. In this section, the 
backward induction approach is used to derive a novel method for nonparametric 
density estimation with associated point-wise credible intervals. The method will be 
based on Pn(v \ Vim) defined in terms of a kernel density estimator |Rosenblatt et ah 


1956, Parzen 


Silverman, 1986 of the form 


K,(y) = 


Vi,T) 


1=1 


where (p(y | /!, r) is a normal density function with center /i and “bandwidth” (vari¬ 
ance) r. 

Begin by considering the marginalization consistency criterion applied to a kernel 
density estimator at sample size N: 


( 10 ) 


Pn- i(y) = / Y'!>/.)]'/« i(x)dx. 


Now “peel off’ the TVth observation x = y n, obtaining 

( 11 ) 


N - 1 If 

PN-i(y) = N KJr-iW + / <!>{y I x,T)p N . 1 (x)dx. 


Next, substitute ( |Il] ) into itself: 


N - 1 If 

—fl—KN-M+jj / Hv I x,t) 


N -1 If 

N K t n _ x (x) + — (f>(x | x'.^pN-^x^dx' 


dx 
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which simplifies to 

+ ^p~ K N-i(y) + J J<t>{y\ x,r)</>(y I x,r)p N . 1 (x')dx'dx. 

Exchanging the order of integration (and switching the names of x and x’ for nota- 
tional consistency), yields 

( 12 ) ~n~ R: N- i(y) + -Jp~ K N-i(y) + Jp J Mv I x ,^)p N -i(x)dx. 

Note that the third term in this expression is like the second term in expression 0. 
with N 2 in place of N and 2r in place of r. Therefore, repeated substitution of ( |TT| ) 
into the recursion gives an expanded representation of px-i (y) as 


(13) 


^ 7\T _ I 

PN-i{y) = ~J^T k n-i(v)- 

j =i 


Note that this procedure of successive substitution is a well-known technique in the 
area of solving Fredholm equations. Indeed, 0 may be recognized as an inhomoge- 


nous Fredholm integral equation of the second kind; see Arfken 2013 for details on 


other solution techniques and references to additional theory. 


Here, we can leverage insights from the statistical context, by expressing (13) as 
an expectation 


(14) 


PN-i(y) = E K^y). 


where Z ~ Geometric (p) for p = Moreover, because each term in (13) is 


itself a kernel density estimator and this representation involves only summation 
and convolution, we can apply the same process to obtain a nested sum expression 
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for each predictive distribution at any number of steps back (N — 2, N — 3, etc.) 
simply by applying the mappings N —» N — 1 and r —* 2 r. Substitution and iteration 
yields 


(15) 


PN-t = Y, 
< 7=1 


k =1 j=l 


N- 1 N -2 


iV-t 


iVi (AT - f)fc ''' (N - ty 


K-Tiv)- 


Again, this can be seen as a nested expectation of independent geometric random 
variables Zh with parameters ph = for h — 1... t: 


(16) 


PN-t(y) = E!E 2 E 3 ... E t K^ hT (y). 


Observe that K^^ hT 


(y) depends on the Zh variables only via their product. Defining 


(17) 


Xt — Zi x Z 2 x ... Z t i 


gives 

(18) PN-t(y) = E K*y t (y) 

where the expectation is now over \t for t between 1 and N — n. 

As a product of independent (but not identically distributed) geometric random 
variables, \t has no readily available closed form. However, a central limit theorem 
(in the log domain) suggests a reasonable log-normal approximation. 

First, note that because the Z h geometric variables are independent, the product 
of their expectations gives the expectation of their product. Accordingly, E \t — 
nL, Ph, ] with p h = . Similarly, VZ f = so E Z? = and = 

nLi ( ' 2 ~2 I ' 1 ' > — nLi Ph 2 by properties of variance. Denote E xt = V and = p- 
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The log-normal approximation is improved by respecting the fact that %* > 1. To 
that end, consider a log-normal random variable with mean p — 1 and variance v, 
which has parameters 


/x = 2 log (77 — 1 ) — ^log(z/+ (77 - l) 2 ), 

(19) _ 2 

a = x/log(l + !//()) - l) 2 ), 


and set Xt — Ct + 1- 

Note that the number of factors in the product defining Xt becomes small as t 
approaches N—n, making the log-normal approximation inaccurate. This has an easy 
practical solution, however, which is to define the backward induction starting at iV-|- 
a for a large enough that the log-normal central limit approximation obtains. Then, 
simply define N as the termination point for the forward simulation. Intuitively, this 
works because if N is thought to be large enough, then N + a also suffices, and Pn{v) 
and pN +a (y ) will be indistinguishable (by assumption). 

Figures |4] and [5] illustrate the impact on the implied kernel for various values of t. 

The marginal kernel densities shown in Figure [4] were computed by numerical 
integration. At present, no convenient form is known for a log-normal scale mixture 
of normals. Fortunately, to implement the coherent density estimation proposed here, 
no evaluation of the density is required. Rather, it is only necessary to simulate from 
a kernel density distribution with a log-normal mixture of normal kernels, which can 
be done trivially as follows. 
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y 


Figure 4. For N = 1000, n — 50 and r = 0.04, the implied kernel, 
marginally over Xt., is shown for t — 1 (dashed), t = 400 (dotted) and 
t = 950. At t = N — n = 950, the kernel is visually indistinguishable 
from a Gaussian kernel with variance 0.04. 


At step t, 


(1) Select a location parameter u at random among the previous n + t — 1 data 
points (of which t — 1 are simulated). 

(2) Next, draw a scale parameter s from the log-normal distribution with param¬ 


eters as in (19). 


(3) Finally, draw (pseudo-)observation y^ +t from N (u,r(s + 1)). 

Note that this forward simulation process yields independent samples of the distant 
future predictive Pn(v \ Vim, S/L-t-iViv)’ which may be obtained in parallel. This com¬ 
putational benefit makes the backward induced kernel density model an attractive 
alternative to Gaussian mixture models for density estimation, which require Markov 
chain algorithms Escobar and West 1995, Neal, 2000|. 
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bandwidth 


FIGURE 5. The shifted log-normal mixing distribution becomes 
sharper as t approaches N — n, collapsing to a near point-mass at 
r = 0.04 (shown in solid black). The dashed line shows the t — 1 one- 
step-ahead predictive diffuse mixing density for n — 50, N — 1000. 
The gray lines represent values of t between 10 and 950 in increments 
of 50. 


It was shown in West 1991] that among location-scale kernel density estimators, 
only the double-exponential (Laplace) kernel can give predictive densities satisfying 
([I|. This result is not in conflict with the model here, because the sequence of kernels 
derived here are log-normal scale mixture of normals, which cannot be represented 
as a simple location-scale family. In the discussion section of that paper, it is re¬ 
marked that the double-exponential kernel density model does not correspond to any 
exchangeable distribution, because the likelihood evaluation depends on the ordering 
of the observed data. Note, however, that temporally coherent kernel density models 
are nonetheless learning symmetric in the following sense. 
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If the ordering of the first n observations is unknown, arriving in a batch, one must 
average over permutations in order to evaluate their joint likelihood: 


( 20 ) 


P(yi:n) 


I ^ ^ Po {.UlTl )Pl (|/tT2 I Uni) ■ ■ ■ Pn— l(.yn„ | U'KX : 7 T n —l ) 1 
TV. ' 

7tGI1 


where 7T G IT denotes a permutation of the indices 1 through n. However, observe 
that this averaging does not impact the conditional distribution of the unobserved 
future data Y^ n+l y N , so long as the observed data yi- n appears in each subsequent 
conditional distribution symmetrically: 


( 21 ) 


/ | \ 1 V - '' Pl:n(yTri. n )P(n+l):N(y(n+l):N \ ZAri-, 

P{y (n+ 1):N | V l:n) = - 


7rGn 


Plmill'n 


P(n+l):N{jJ(n+l):N \ l/7Ti ;rl ) P(n+l):N{]J(n+T):N \ Z/l:n)- 


This implies, remarkably, that for a backward-induced model with permutation- 
invariant conditional distributions, the ordering of the observed data matters for 
likelihood evaluation (which requires permutation averaging), but does not matter 
for posterior inference via forward simulation. 


3.2. Demonstrations. 


3.2.1. Synthetic data. For this demonstration, n = 50 and n = 500 observations are 
drawn from a mixture of two Gaussians with equal weights: 

(22) P(y) = ^(y|2,4) + ^(y|10,l). 

Each data set is fit using a backward induced kernel density procedure with N = 1000 
and r = 0.08. These values were elicited by inspection of simulated data from 
mixtures of normals and the corresponding kernel density fit at different sample sizes 
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Figure 6. Data are drawn from a mixture of two Gaussians, with 
n — 50. Three density estimates overlay the data histogram. Solid is 
the backward induced KDE with N = 1000 and r = 0.04; dashed is 
the true density; dotted is the R KDE with bandwidth select method 
SJ. One-thousand draws from the posterior density are shown in gray. 


and bandwidths. The resulting point estimate and uncertainty bands are depicted 
in Figures [6] and [7} As expected, the uncertainty bands of the n = 500 sample are 
much tighter than those of the n = 50 sample. For comparison, the R kernel density 


estimate with bandwidth selection method SJ, as described in Sheather and Jones 


1991 , is also shown. 


3.2.2. The galaxy data. The “galaxy data” have been widely used to exemplify 
Bayesian and non-Bayesian density estimation techniques. The data are 82 velocity 
measurements (in km/second) of galaxies obtained from an astronomical survey of 
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Figure 7. Data are drawn from a mixture of two Gaussians, with 
n = 500. Three density estimates overlay the data histogram. Solid is 
the backward induced KDE with N = 1000 and r = 0.08; dashed is the 
true density; dotted is the R KDE with bandwidth selection method 
SJ. One thousand draws from the posterior density are shown in gray. 
The uncertainty bands are much narrower with n = 500 than with 
n = 50. 


the Corona Borealis region Roeder, 1990 . Notable Bayesian papers using this data 


include Carlin and Chib 119951, Escobar and West 1995 and Bernardo 1999 


Figure [8] depicts the posterior mean for the N = 1000, r = 0.04 model, along with 
one-thousand posterior draws to provide visual uncertainty bands. Also depicted are 
the default kernel density estimate from the R software language and a histogram. 
Although the point estimate is less smooth than the default kernel density estimate, 
the posterior draws reflect substantial uncertainty, covering both the default kernel 
density estimate and the histogram contours. 
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Figure 8. The galaxy data of Roeder 1990 
nomical measurements. The posterior mean 
N = 1000 and r = 0.08 backward induced 
dashed line depicts the default KDE in R. 


consists of n = 82 astro- 
density is shown for the 
model (solid line). The 


3.3. Uncertainty reduction asn^ oo. As mentioned above, the sequential coher¬ 
ence property ([!]) entails that the sequence of predictive densities forms a Martingale 
sequence. Because it is well-known that kernel density estimation is consistent, it fol¬ 
lows directly that the posterior mean is also consistent. To study the concentration 
of the posterior about this mean, one can apply the Azuma-Hoeffding inequality. In 
particular, for any y, 

(23) My) -Pt+Ml < l ,°’ T) , 


which follows from the fact that the kernel density is most peaked when the band¬ 
width equals r and the kernel is Gaussian, and that density functions are always 
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Figure 9. An application of the Azuma-Hoeffding inequality to the 
Martingale sequence of predictive kernel densities implies shrinking 
uncertainty about the posterior mean of the m-step ahead functional, 
as sample size increases. This illustration depicts the concentration of 
posterior mass as a progressively narrowing “uncertainty cone,” fan¬ 
ning out from the one step ahead distribution, as the observed sample 
size is pushed forward from n to n'. Here N = n + m and N' = n' + m 
for a fixed m. 


greater than or equal to zero. Therefore, Azuma-Hoeffcling gives 

F* r (| Pn(v) -Pn(y) | > e} < 2 exp 
(24) 

/ —e 2 

^ ' v 2c 2 ('^( 1 )(7t, + 2) — , 0( 1 )(n + m + 2)) 

where ^ 1 \-) denotes the first derivative of the polygamma function, c = 0(0 | 0, r) 
and N — n + m. Thus, the asymptotic point-wise concentration is dictated by the 
growth of the difference (n + 2) — 0b)(n + m + 2)) as n —» oo. It is easy to check 
that indeed this difference approaches zero as n grows. 


,2c 2 Ef=„ +1 (i + l)- 2 
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4. Discussion 

The sequential coherence condition plays the same role in the backward induc¬ 
tion approach as exchangeability plays in defining traditional Bayesian probability 
models. In fact, an exchangeable model is always temporally coherent. However, 
interesting and useful models that satisfy these conditions need not be exchangeable 
— such as the kernel density model in the previous section. The choice of the large- 
sample predictive density p N (• \ Ui-.n) plays the same role in the backward induction 
approach as the choice of a sufficient statistic does in an exchangeable Bayesian 
model. 

In light of the fact that exchangeability and sequential coherence are equivalent 
for infinite sequences, the approach presented in this paper might be considered a 
new computational approximation to standard Bayesian modeling. However, the 
new approach has many additional advantages. First, the new approach to model 
construction allows direct control of where the posterior will converge to, even under 
misspecihcation. Second, the posterior mean predictive density can be accurately 
approximated without resorting to Monte Carlo simulation. Third, prior informa¬ 
tion can be readily incorporated via “pseudo-data”, even for models (like the kernel 
density model shown here) outside of the exponential family. Finally, concentration 
bounds are easily established as a function of sample size by applying the Azuma- 
Hoeffding inequality. For these reasons, the sequential coherence and backward 
induction represents a promising new approach to probabilistic modeling for data 
analysis. 
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